FIR filter optimization for video processing on FPGAs

نویسندگان

Martin Kumm

Diana Fanghänel

Konrad Möller

Peter Zipf

Uwe Meyer-Bäse

چکیده

Two-dimensional finite impulse response (FIR) filters are an important component in many image and video processing systems. The processing of complex video applications in real time requires high computational power, which can be provided using field programmable gate arrays (FPGAs) due to their inherent parallelism. The most resource-intensive components in computing FIR filters are the multiplications of the folding operation. This work proposes two optimization techniques for high-speed implementations of the required multiplications with the least possible number of FPGA components. Both methods use integer linear programming formulations which can be optimally solved by standard solvers. In the first method, a formulation for the pipelined multiple constant multiplication problem is presented. In the second method, also multiplication structures based on look-up tables are taken into account. Due to the low coefficient word size in video processing filters of typically 8 to 12 bits, an optimal solution is found for most of the filters in the benchmark used. A complexity reduction of 8.5% for a Xilinx Virtex 6 FPGA could be achieved compared to state-of-the-art heuristics. Introduction Two-dimensional linear filters with finite impulse response (FIR) are one of the most fundamental operations used in image and video processing. They are used, e. g., in applications which contain contrast improvement, denoising, sharpening, target matching, and feature enhancement [1]. Compared to infinite impulse response filters, FIR filters have a strict stability, and highthroughput implementations are easily possible using pipelining as no recursions are involved. However, they are computationally expensive as many multiply accumulate (MAC) operations are necessary for each pixel of the resulting image. While this is very demanding for a microprocessor or digital signal processor, the inherent parallelism of field programmable gate arrays (FPGAs) can be used to accelerate the FIR operation. Modern FPGAs directly incorporate embedded multipliers or DSP blocks which also include preand postadders for MAC operations. Xilinx’s DSP blocks (Xilinx Inc., San Jose, CA, USA) of Virtex 5/6, Spartan 6, and the 7 series FPGAs provide 18×25-bit signedmultipliers. More flexible are the variable precision DSP blocks of the latest *Correspondence: [email protected] 1Digital Technology Group, University of Kassel, Kassel 34121, Germany Full list of author information is available at the end of the article FPGAs of Altera, the Stratix V, Cyclone V, and Aria V devices (Altera, San Jose, CA, USA). Each DSP block can be configured as three independent 9 × 9-bit multipliers, two independent 16 × 16-bit, 15 × 17-bit, or 14 × 18-bit multipliers, or a single 18×36-bit or 27×27-bit multiplier. However, embedded multipliers and DSP blocks are limited resources even on modern low-cost FPGAs, and they may have a higher power consumption compared to constant multiplication using the carry-chain resources [2]. Especially in image processing, embedded multipliers are often underutilized because of the small word lengths used. Typically, only 8 bits per color and 10 bits for a luminance representation are used. In most filter applications, the coefficients are fixed, which can be used to reduce the complexity of the multiplication. Furthermore, partial results can be shared inside a single multiplier or between multipliers of different constants to reduce hardware resources. Different methods have been proposed over the years for such multiple constant multiplications (MCM): (a) MCM using additions, subtractions, and bit shifts [3-33]; (b) MCM using look-up tables (LUTs) and adders [34,35]; (c) Distributed arithmetic (DA) [36-42]. © 2013 Kumm et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Kumm et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:111 Page 2 of 18 http://asp.eurasipjournals.com/content/2013/1/111 In method (a), constant multiplications are realized using additions, subtractions, and bit shifts only. These operations form a so-called adder graph, so this method is called the adder graph MCM method in the following. It was originally developed for software or VLSI applications [3] but also maps well to the fast carry chains of FPGAs. In method (b), the input word is split into smaller chunks that fit into the input word size of the FPGA LUTs. These LUT results are shifted and added afterward to form the multiplication result. LUTs and adders are also used in method (c), but there the folding equation of the FIR filter is rearranged in such a way that identical LUTs can be used. This is very beneficial in sequential FIR implementations but costly in parallel implementations. Because it was shown in the recent years that multiplier blocks using add, subtract, and shift operations (method a) consume considerable less logic resources compared to parallel DA implementations [5-7], the DA approach is not further considered. Due to the relatively large routing delays compared to the fast carry chain, a pipelined implementation of the adder graph is necessary to obtain the maximum speed of the FPGA [2,5-10]. It was shown by Faust et al. [35] that the LUT-based approach (method b) is competitive to the adder graph method. Thus, pipelined circuits using the combination of methods (a) and (b) are investigated in this paper. Contribution of this work The main contribution of this article is the description of two novel optimization methods, one for the adder graph MCM problem including pipelining (the pipelined MCM problem [9]) and one for a combination of this method with a pipelined realization of the LUT-based method mentioned above. Eachmethod is formulated as a boolean integer linear program (BILP, or 0-1 ILP) and then reduced to a mixed integer linear program (MILP). Hence, if the MILP solver finds a solution in reasonable time, an optimal solution for the given cost model is found. To the best of our knowledge, this is the first time an optimal method for solving the pipelinedMCM (PMCM) problem is proposed. The complexity of the adder graph MCMmethod heavily depends on the coefficient values, while the complexity of the LUT-based approach mainly depends on the input word size. Therefore, sometimes, one method or the other delivers better results. For this, a combination of both methods is proposed in this work by incorporating the LUT-based multipliers in the integer linear programming (ILP) formulation of the PMCM problem. Due to the low coefficient word size of typically 8 to 12 bits in image processing, a short convergence time of the ILP solver is very likely, which makes the proposed optimization an ideal candidate for image processing. The remaining of this paper is organized as follows. The related work is discussed in the next section, followed by an introduction of the used FIR architectures for image processing. Then, an ILP formulation for the PMCM problem is described which is later extended for additional LUT-based multiplication. Finally, results from the optimizations and FPGA synthesis are presented and discussed, followed by a conclusion. Related work MCM using additions, subtractions, and bit shifts Different methods have been proposed over the years to realize constant multiplication using additions, subtractions, and bit shifts only. Finding the optimal configuration of these operations is known asMCMproblem, which has been an active research topic for almost the last two decades [3-33]. The objective is usually defined by minimizing the number of adders and subtractors (shifts are assumed to be free, as they can be implemented using wires). An example adder graph which realizes a multiplier block with the constants of the set {44, 130, 172} is shown in Figure 1a. Each node in the graph corresponds to an adder or subtractor, indicated by ‘+’ or ‘−.’ The numeric node value represents the realized multiple of the input, i. e., node ‘1’ corresponds to the input of the MCM block. To have a unique representation, all node values are defined to be odd. They can be made even by a simple bit shift as shown at the output. All edge weights represent left shifts, e. g., node ‘5’ is realized by left shifting the input x by 2 bits and adding the unshifted input: 22x + 20x = 5x. Right shifts are represented by negative edge weights. The MCM problem is NP complete [4]. Hence, most of the proposed algorithms are heuristics, and less work was directed toward optimal solutions. Early work was done by Bull and Horrocks [3] which was later extended by Dempster and Macleod to the modified Bull and Horrocks algorithm [14]. In the same work, the ndimensional reduced adder graph (RAG-n) algorithm was proposed which was one of the leading heuristics for years. Major improvements could be achieved by the work of Voronenko and Püschel with their cumulative benefit heuristic (Hcub) [4]. By spending a bit more algorithmic complexity and evaluating adder graph topologies up to a depth of three, they could reduce the required additions/subtractions by 20% on average compared to RAGn. A competing approach based on difference graphs was proposed by Gustafsson [16]. It tends to be beneficial compared to Hcub in case large coefficient sets and/or low coefficient word lengths are used but may be worse in other cases. Many approaches use ILP formulations, for which optimal solvers exist. However, the search space is often Kumm et al. EURASIP Journal on Advances in Signal Processing 2013, 2013:111 Page 3 of 18 http://asp.eurasipjournals.com/content/2013/1/111

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementing High-Order FIR Filters in FPGAs

Contemporary field-programmable gate arrays (FPGAs) are predestined for the application of finite impulse response (FIR) filters. Their embedded digital signal processing (DSP) blocks for multiply-accumulate operations enable efficient fixed-point computations, in cases where the filter structure is accurately mapped to the dedicated hardware architecture. This brief presents a generic systolic...

متن کامل

Comparison Between FPGA Logic Resources And Embedded Resources Used By Discrete Arithmetic (DA) Architecture To Design FIR Filter

FIR Filter is a fundamental function in many applications, mainly in the field of Digital Signal Processing (DSP). The main problem of this function is its computational complexity, needed to process a signal. Currently, FPGAs are widely used for a variety of computationally complex applications. New generation of FPGAs contain not only basic configurable logic resources but also complex embedd...

متن کامل

A Novel FIR Filter Architecture for Efficient Signal Boundary Handling on Xilinx VIRTEX FPGAs

FIR filters are often used in digital signal processing. This paper presents a novel architecture for FIR filters on Xilinx Virtex FPGAs. The architecture is particularly useful for handling the problem of signal boundaries filtering, which occurs in finite length signal processing (e.g. image processing). Based on a bit parallel arithmetic, our architecture is fully scalable and parameterised....

متن کامل

Transposed Form FIR Filters

This application note describes a high-speed, reconfigurable, full-precision Transposed Form FIR filter design implemented in the VirtexTM and Virtex-II series and SpartanTM-II family of FPGAs. The VHDL reference design provided with this application note is easily modified to change filter parameters including coefficients and the number of taps. By illustrating a design methodology for digita...

متن کامل

Design of High- speed FIR filter Based on Booth Radix-8 Multiplier Implemented on FPGA

Finite Impulse Response (FIR) digital filters have potential for high-speed and low-power realization through parallel processing on FPGA. In this paper, an efficient implementation of FIR filters, which uses a Booth Radix-8 multiplier, is suggested. For implementation of the said FIR filter MATLAB FDATool is employed to determine various filter coefficients. The 8 order FIR filters have been d...

متن کامل

Dynamic Partial Reconfigurable FIR Filter Design

This paper presents a novel partially reconfigurable FIR filter design that employs dynamic partial reconfiguration. Our scope is to implement a low-power, area-efficient autonomously reconfigurable digital signal processing architecture that is tailored for the realization of arbitrary response FIR filters using Xilinx FPGAs. The implementation of design addresses area efficiency and flexibili...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

EURASIP J. Adv. Sig. Proc.

دوره 2013 شماره

صفحات -

تاریخ انتشار 2013

FIR filter optimization for video processing on FPGAs

نویسندگان

چکیده

منابع مشابه

Implementing High-Order FIR Filters in FPGAs

Comparison Between FPGA Logic Resources And Embedded Resources Used By Discrete Arithmetic (DA) Architecture To Design FIR Filter

A Novel FIR Filter Architecture for Efficient Signal Boundary Handling on Xilinx VIRTEX FPGAs

Transposed Form FIR Filters

Design of High- speed FIR filter Based on Booth Radix-8 Multiplier Implemented on FPGA

Dynamic Partial Reconfigurable FIR Filter Design

عنوان ژورنال:

اشتراک گذاری